Abstract
Background and Aims Acute Myeloid Leukemia (AML) presents significant diagnostic and therapeutic challenges due to rapidly evolving genomic risk classifications and the emergence of novel treatment options. While large language models (LLMs) like ChatGPT show promise in medical domains, they frequently deliver inaccurate or vague answers to complex hematologic questions. Even advanced models with human-like reasoning outperform physicians in general tasks, yet their capabilities in AML-specific scenarios remain untested. To address this gap, we first evaluated leading LLMs (ChatGPT, Claude, DeepSeek) on real-world AML cases and found suboptimal performance. Thus, we created an innovative architecture of expert AI agents. An agent is an AI assistant that is trained on specific domain data and can perform specific tasks. Working together like a virtual tumor board, these agents collaborate just as human specialists do, discussing and refining treatment recommendations as a cohesive team of digital experts.
Methods The virtual AML panel (VAP) is comprised of five specialized AI agents, a moderator who guides discussion and synthesizes final recommendations, a case processor who structures unstructured case data, a pathology agent trained exclusively on WHO classification criteria, a prognostication agent trained on ELN 2022 and 2024 and a therapy recommendation agent trained on NCCN, ELN, and other guidelines.
A total of 900 domain-level evaluations were performed across 20 complex, real-world AML cases. Responses from the VAP were compared to those generated by top AI tools: ChatGPT (GPT-4o and GPT-o3), Claude, and DeepSeek. Three board certified leukemia experts scored responses on a 1–5 Likert scale across three domains (diagnosis, prognosis and treatment). Score ≥ 4 was considered correct per expert opinion. Responses were also assessed for factual accuracy and categorized as containing no, minor, or major errors.
Results 93% of the VAP responses were correct per expert opinion compared to 61% GPT-4o, 52% GPT-o3, 48% Claude and 41% DeepSeek. Across individual domains, VAP was considered correct in 98% of diagnosis, 85% of prognosis and 95% of treatment recommendations significantly outperforming GPT-40, 73%, 53% and 58%, GPT-03, 70%, 38% and 48%, Claude, 60%, 48% and 35% and DeepSeek, 40%, 50% and 33%, respectively.
On a 1–5 scale, VAP scored highest across all evaluated domains (overall 4.7; 4.9 / 4.6 / 4.7 for Diagnosis / Prognosis / Treatment), followed by GPT-4o (3.8; 4.1 / 3.5 / 3.7), GPT-o3 (3.5; 4.1 / 3.1 / 3.3), Claude (3.3; 3.5 / 3.2 / 3.3), and DeepSeek (3.2; 3.1 / 3.5 / 3.0).
The VAP demonstrated impressive factual reliability, with no major errors, 88.9% fully accurate responses, and only minor issues in the remainder. GPT-4o was the next best model with major errors in 35% of cases, however only 25%of cases were considered error free. GPT-o3, Claude and DeepSeek all had major errors >50% at 70%, 70% and 63%.
Conclusion We developed an AML virtual tumor board with multiple AI agents trained on domain-specific data to generate accurate, guideline-driven responses at an expert level. This system is designed to support hematologists in clinical decision-making for the diagnosis and treatment of patients with AML. Our work represents a step forward in integrating AI into precision medicine and hematology.
This feature is available to Subscribers Only
Sign In or Create an Account Close Modal